Observation selection bias in contact prediction and its implications for structural bioinformatics

نویسندگان

G. Orlando

D. Raimondi

W. F. Vranken

چکیده

Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

BIOINFORMATICS Prediction Error Estimation: A Comparison of Resampling Methods

Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection, and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the ’...

متن کامل

SurvJamda: an R package to predict patients' survival and risk assessment using joint analysis of microarray gene expression data

UNLABELLED SurvJamda (Survival prediction by joint analysis of microarray data) is an R package that utilizes joint analysis of microarray gene expression data to predict patients' survival and risk assessment. Joint analysis can be performed by merging datasets or meta-analysis to increase the sample size and to improve survival prognosis. The prognosis performance derived from the combined da...

متن کامل

Automated benchmarking of peptide-MHC class I binding predictions

MOTIVATION Numerous in silico methods predicting peptide binding to major histocompatibility complex (MHC) class I molecules have been developed over the last decades. However, the multitude of available prediction tools makes it non-trivial for the end-user to select which tool to use for a given task. To provide a solid basis on which to compare different prediction tools, we here describe a ...

متن کامل

Addition of Contact Number Information Can Improve Protein Secondary Structure Prediction by Neural Networks

Prediction of protein secondary structures is one of the oldest problems in Bioinformatics. Although several different methods have been proposed to tackle this problem, none of these methods are perfect. Recently, it is proposed that addition of other structural information like accessible surface area of residues or prior information about protein structural class can significantly improve th...

متن کامل

SVScore: an impact prediction tool for structural variation

Summary Here we present SVScore, a tool for in silico structural variation (SV) impact prediction. SVScore aggregates per-base single nucleotide polymorphism (SNP) pathogenicity scores across relevant genomic intervals for each SV in a manner that considers variant type, gene features and positional uncertainty. We show that the allele frequency spectrum of high-scoring SVs is strongly skewed t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 6 شماره

صفحات -

تاریخ انتشار 2016

Observation selection bias in contact prediction and its implications for structural bioinformatics

نویسندگان

چکیده

منابع مشابه

BIOINFORMATICS Prediction Error Estimation: A Comparison of Resampling Methods

SurvJamda: an R package to predict patients' survival and risk assessment using joint analysis of microarray gene expression data

Automated benchmarking of peptide-MHC class I binding predictions

Addition of Contact Number Information Can Improve Protein Secondary Structure Prediction by Neural Networks

SVScore: an impact prediction tool for structural variation

عنوان ژورنال:

اشتراک گذاری